OpenStreetMap Data Wrangling with MongoDB

Project Overview

In this project , i will select map data for an area from OpenStreetMap and use data wrangling techniquesto assess the validity, accuracy, completeness, consistency and uniformity of the data. I will clean the dataset, upload it to mongodb and make my own analysis to find some interesting insights.

1. Download the data.

The area I choose to look at is Kenmore,Washington, which is the city i live in. I downloaded the dataset from OpenStreetMap via Overpass API. I will apply the techniques learned from Udacity's Data Wrangling with MongoDB course to explore, audit and clean this dataset then convert the xml to JSON.

In [47]:
from IPython.display import Image
Image("kenmore.png", width ="600")
Out[47]:

run step1_download_data.py

2. Audit the data.

Before adding it to Mongodb, we need to check if there're any data quality issues in there. I will parse through it with ElementTree and start with counting different elements.

run step2_auditingdata.py, get

{'bounds': 1, 'member': 8145, 'meta': 1, 'nd': 341142, 'node': 303563, 'note': 1, 'osm': 1, 'relation': 325, 'tag': 194947, 'way': 33921}

Because what i'm most interested in is the tag keys. I then looped through all tag keys and printed them.

run step2_auditingdata.py, get

[('addr:housenumber', 69105), ('addr:street', 69060), ('addr:postcode', 68970), ('addr:city', 68868), ('building', 68445), ('source', 40278), ('highway', 32385), ('name', 19917), ('addr:unit', 12804), ('tiger:county', 12669), ('tiger:cfcc', 12597), ('tiger:name_base', 12009), ('tiger:reviewed', 8835), ('tiger:name_direction_suffix', 6696), ('service', 5166), ('tiger:name_direction_prefix', 4854), ('power', 4614), ('lanes', 4398), ('maxspeed', 3792), ('tiger:name_type', 3405), ('amenity', 2925), ('ref', 2664), ('oneway', 2577), ...]

Finally, i also audited if there're any problematic tag keys that are not available Mongodb, i printed number of keys for each type and found no problematic keys.

run step2_auditingdata.py, get

{'lower': 72762, 'lower_colon': 120222, 'other': 1963, 'problemchars': 0}

3. Problems in the dataset

3.1 Street Name

I found first of all the inconsistencies in the street names, with some use abbreviations while some not. I then updated the abbreviations with the full names. Here're two examples, for those whose ending with numbers, i put them to the front: 15th Ave Northeast => 15th Avenue Lake City Way Northeast #183 => No.183 Lake City Way Northeast The following is the process:

1. create the regex format of finding abbreviations, also create the expected full name and the mapping of abbreviation to its full name.
2. Audit each value within <tag> with key equals "addr:street", if the pattern is not expected, update using the mapping dictionary.
3. Save the file with the updated street name.

run step3to5_audit_and_convert.py, by running step3, we can get

124th Avenue Northeast #A => No A 124th Avenue Northeast 15TH Ave Northeast => 15TH Avenue Northeast Juanita-Woodinville Way Northeast => Juanita Woodinville Way Northeast Pinehurst Pocket Park;19th Avenue Northeast => Pinehurst Pocket Park 19th Avenue Northeast 15th Ave Northeast => 15th Avenue Northeast Lake City Way Northeast #183 => No 183 Lake City Way Northeast 233rd St SW => 233rd Street Southwest 25th Ave NE => 25th Avenue Northeast 138th Way NE => 138th Way Northeast 18006 120TH AVE NE => 18006 120TH Avenue Northeast 104th Ave NE => 104th Avenue Northeast 141st Pl NE => 141st Place Northeast Ballinger Way NE => Ballinger Way Northeast Bothell Way NE => Bothell Way Northeast 120th Avenue NE => 120th Avenue Northeast NE 158th St => Northeast 158th Street NE 186th St => Northeast 186th Street Main St => Main Street NE Bothell Way => Northeast Bothell Way Bothell-Everett Highway => Bothell Everett Highway

These are the names finally been updated.

Finally update all incorrect street names and save the new file as 'kenmore_updated_street.xml'

By checking again, we found there're no street names needed to be updated. Auditing done.

In [ ]:
 

3.2 City Name

I also found inconsistencies in the city names. Some city name's are miss spelled and some are not capitalized. Then i updated the city names to fix the wrong spelling issue. The following is the process:

1. create a correct list of city names in the map area.
2. Audit each "addr:city" value and list all city names in the file.
3. Because there're only 13 different city names in total, easy to find the incorrect spellings and non-capitalization names. Update them to the correct name and format.
4. Save the new file with city name updated.

run step3to5_audit_and_convert.py, by running step4, we can get

{'BOTHELL', 'Bothel', 'Bothell', 'Edmonds', 'Kenmore', 'Kirkland', 'Lake Forest Park', 'Mountlake Terrace', 'Seattle', 'Shoreline', 'Woodenville', 'Woodinville', 'kenmore'}

By pulling all city namesin the file, there're only 13 names, so its easy to audit and make corrections using if-else logic.

Finally update all incorrect city names and save the new file as 'kenmore_updated_streetcity.xml'

By checking again, we found there're no city names needed to be updated. Auditing done.

In [ ]:
 

4.Overview of Data

4.1 Convert from xml to json format

The following is the process:

1. Process 2 types of top level tags: node and way.
2. All attributes of node and way should be turned into key/value pairs.
3. Problematic tag keys are ignored.
4. Tag keys starts with "addr:" will be added to a dictionary address.

run step3to5_audit_and_convert.py, by running step 5 we can make the convertion

4.2 Upload the json file to Mongodb

I used pymongo to create the requested database and used subprocess package to run shell commands to import json files.

Size of two files

The following code out put the size of the json and osm file

In [20]:
from subprocess import check_output, Popen, PIPE

osm = re.sub(r'[\s]', ' ', check_output("du -h kenmore.osm".split()))
json = re.sub(r'[\s]', ' ', check_output("du -h final.json".split())); print(osm + '\n' + json)
 68M kenmore.osm 
 67M final.json 

The Json file is 67M and the original OSM file is 68M

In [2]:
import signal
import subprocess
import os


pro = subprocess.Popen("mongod", preexec_fn = os.setsid) 
In [3]:
from pymongo import MongoClient
import bson

db_name = "udacity_proj"

client = MongoClient()
db = client[db_name]
In [5]:
collection = "osm_kenmore"
json_file = "final.json"

mongoimport_cmd = "mongoimport --db " + db_name + \
                  " --collection " + collection + \
                  " --file " + json_file

# drop collection if it exists
if collection in db.collection_names():
    print "dropping collection"
    db[collection].drop()

# Execute the command
print "Executing: " + mongoimport_cmd
subprocess.call(mongoimport_cmd.split())
Executing: mongoimport --db udacity_proj --collection osm_kenmore --file kenmore_updated_streetcity.xml.json
Out[5]:
0
In [6]:
data_db = db[collection]

4.3 Some statistics about the dataset

Number of Documents

In [7]:
data_db.find().count()
Out[7]:
337484

Number of Nodes

In [8]:
data_db.find({"type":"node"}).count()
Out[8]:
303563

Number of Ways

In [9]:
data_db.find({"type":"way"}).count()
Out[9]:
33921

Number of unique users

In [11]:
len(data_db.distinct("created.user"))
Out[11]:
337

Top 10 contributor

In [24]:
for doc in data_db.aggregate([
                            {"$match": {'created.user': {"$exists": 1} }},
                             {"$group": {"_id": '$created.user', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 10}
                            ]):
    print doc
{u'count': 74061, u'_id': u'Glassman_Import'}
{u'count': 72466, u'_id': u'SeattleImport'}
{u'count': 67797, u'_id': u'seattlefyi_import'}
{u'count': 20312, u'_id': u'Glassman'}
{u'count': 13008, u'_id': u'bdp'}
{u'count': 12764, u'_id': u'STBrenden'}
{u'count': 9851, u'_id': u'Natfoot'}
{u'count': 7220, u'_id': u'sctrojan79-import'}
{u'count': 6785, u'_id': u'Debbie Bull'}
{u'count': 6014, u'_id': u'compdude'}

Top amenity

In [42]:
for doc in data_db.aggregate([
                            {"$match": {'amenity': {"$exists": 1} }},
                             {"$group": {"_id": '$amenity', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 10}
                            ]):
    print doc
{u'count': 313, u'_id': u'parking'}
{u'count': 71, u'_id': u'restaurant'}
{u'count': 67, u'_id': u'school'}
{u'count': 52, u'_id': u'fast_food'}
{u'count': 49, u'_id': u'place_of_worship'}
{u'count': 49, u'_id': u'bench'}
{u'count': 35, u'_id': u'cafe'}
{u'count': 34, u'_id': u'fuel'}
{u'count': 32, u'_id': u'bicycle_parking'}
{u'count': 25, u'_id': u'bank'}

Top street

In [30]:
for doc in data_db.aggregate([
                            {"$match": {'address.street': {"$exists": 1} }},
                             {"$group": {"_id": '$address.street', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 1}
                            ]):
    print doc
{u'count': 524, u'_id': u'Northeast 132nd Street'}

4.4 Other thoughts on the dataset

Besides the statistics above, i'm also interested in verifying the most popular fast food restaurant in this area. In my memory, Subway should have most stores. I also want see if Starbuck and BOA both have more stores than their competitors in this area, just like bellevue downtown.

Through some aggregations, i found:

1. Subway has 8 stores, higher than all other competitors.
2. Startbuck is dominant in Kenmore area.
3. BOA has 6 store in this area, which is the same as its biggest competitor Chase.

These findings demonstared my assumption, which is interesting.

In [45]:
for doc in data_db.aggregate([
                            {"$match": {'amenity': 'fast_food' }},
                             {"$group": {"_id": '$name', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 5}
                            ]):
    print doc
{u'count': 8, u'_id': u'Subway'}
{u'count': 6, u'_id': u"McDonald's"}
{u'count': 3, u'_id': u'Taco Time'}
{u'count': 3, u'_id': u'Jack in the Box'}
{u'count': 2, u'_id': u"Domino's Pizza"}
In [38]:
for doc in data_db.aggregate([
                            {"$match": {'amenity': 'cafe' }},
                             {"$group": {"_id": '$name', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 5}
                            ]):
    print doc
{u'count': 9, u'_id': u'Starbucks'}
{u'count': 2, u'_id': u'Brew'}
{u'count': 1, u'_id': u'Perfetto Espresso'}
{u'count': 1, u'_id': u'Cold Stone Creamery'}
{u'count': 1, u'_id': u'192 Brewing Co. espresso stand'}
In [37]:
for doc in data_db.aggregate([
                            {"$match": {'amenity': 'bank' }},
                             {"$group": {"_id": '$name', "count": {"$sum": 1}}},
                             {"$sort": {'count': -1}},
                             {"$limit": 5}
                            ]):
    print doc
{u'count': 6, u'_id': u'Bank of America'}
{u'count': 6, u'_id': u'Chase'}
{u'count': 3, u'_id': u'Wells Fargo'}
{u'count': 3, u'_id': u'Washington Federal'}
{u'count': 2, u'_id': u'US Bank'}

Conclusion

The map data of Kenmore area has a good data quality in general. However, there're some inconsistencies in street names where some have abbreviations and there're incorrect spellings in city names. All these issues are found by auditing and finnaly fixed by replacing with the correct formats.The project tells me the importance of data quality and teaches me how to QA the data efficiently.

In [ ]:
 
In [ ]:
 
In [ ]: